# FPGA systolic-based Architecture for Video Applications in Real Time

Griselda Saldaña

Computer Science Department National Institute for Astrophysics, Optics and Electronics (INAOE) Sta. Maria Tonantzintla, Puebla. Mexico gsaldan@inaoep.mx

#### Abstract

Motion estimation constitutes a significant computational part of video compression standards such as MPEG4. The most frequently used technique is based on a Full Search Block Matching Algorithm which is highly computing intensive and requires a large bandwidth to obtain real-time performance. This paper describes an efficient reconfigurable architecture suitable for motion estimation.

### 1. Motivation

Motion estimation (ME) is a basic bandwidth compression method used in video-coding systems. Among several computation methods, the Full Search Block Matching Algorithm (FBMA) is the most used.

ME requires a huge amount of computations, which justifies the great research effort that has been made to develop efficient dedicated architectures and specialized processors. FBMA algorithm is extremely regular and suitable for implementations based on array structures.

Other faster block-matching algorithms have been also proposed however most of them consider only a reduced set of candidate motion vectors, simpler matching or distortion computations, or even a subset of the block motion field. These algorithms provide suboptimal solutions, since the considered search spaces are necessarily reduced and most of them apply non-regular processing schemes.

### 2. Previous Work

Several methods have been proposed for ME hardware implementation [1-3] such as block matching algorithms, parametric/motion models, optical flow, and pel-recursive techniques. Among these approaches,

block matching is the most common.

These architectures make use of massive pipelining and parallel processing provided by systolic [4] or linear arrays. Most of them require two separate memories for storing the current frame and the previous frame increasing their size; hence, efficient memory utilization becomes one of the most important design problems. Furthermore, they are not intrinsically power efficient.

## 3. Contributions

This work proposes an integrated platform to implement several algorithms based on windowoperators in a single processing module aimed to pursue the implementation of higher complexity algorithms such as ME.

The system is based on a customizable 2D systolic architecture and a smart memory schema to reduce the number of access to a global memory, which increases the overall system clock frequency. Furthermore, the system is capable of process chaining based on the use of local storage buffers to reduce the number of access to data memories and router elements to handle data movement among different structures inside the same architecture.

### 4. Preliminary results

The proposed architecture based on a 2D processor array is shown in Fig. 1.

The global bus receives processing parameters, from the High level control unit and distributes them inside the architecture to interchange back and forward control or configuration information.

Input buffers keep some rows of the image been processed as neighboring elements as required by FBMA. These data can be accessed in parallel reducing the accessing time required and they add the possibility to carry out computations with local data.



Figure 1. Block diagram of the architecture.

The architecture has been implemented using Handel-C, DK4 and synthesized to a XCV2000E-6 Virtex-E FPGA with the Xilinx Synthesis Technology (XST) tool and placed and route with Foundation ISE 7. Synthesis results are shown in Table 1.

Table 1. Technical data for the architecture

| Element                 | Specification               |
|-------------------------|-----------------------------|
| FPGA technology         | 0.18 µm 6-layer metal       |
|                         | process                     |
| Number of PEs           | 49                          |
| Off-chip memory data    | 21 bit-address, 32 bit data |
| buses                   |                             |
| Internal data buses for | 8 bits for fixed-point      |
| ALUs                    | operations                  |
| Number of Block RAMs:   | 18                          |
| Number of Slices        | 14,728                      |
| Number 4 input LUTs     | 28,239                      |
| Number of Flip Flops    | 8,348                       |
| Estimated Power         | 1.492 W                     |
| Consumption             |                             |
| Clock frequency         | 55 MHz                      |
| Peak performance        | ~9 GOPs                     |

The clock frequency reported by the synthesis tool is 55 MHz as regards a peak performance of ~9 GOPs is achieved. In order to test the architecture, 640x480 gray-level images and mask of 7x7 have been used. The main objective for the architecture is to support several algorithms based on window processing therefore as a first step filtering, matrix multiplication, morphologic operators and Gaussian pyramid had been implemented as shown in Figures 2.

ME application is been developed using the same processing unit but a double ALU schema has been used in order to reuse overlapped data in neighbor macro-blocks.



**Figure 2.** (a) Filtering, (b) Morphologic Operators, (c) 2 level gaussian pyramid, (d) Matrix Multiplication.

### 5. Conclusion

In this paper a versatile, modular and scalable hardware architecture was presented. The throughput achieved is about 9 GOPs which implies performance in real time.

ME is currently being developed and analyzed, and some optimizations are going to be implemented in order to improve performance.

### 6. References

[1] T. Komarek and P. Pitsch, "Array architectures for blockmatching algorithms", IEEE Trans. CAS, Vol. 36, No. 10, Oct. 1989, pp. 1301-1308.

[2] P. Pirsch, N. Demassieux, W. Gehrke, "VLSI architectures for video compression", Proc. of the IEEE, Vol. 83, No. 2, Feb. 1995, pp. 220-246.

[3] M. Sung, "Algorithms and VLSI architectures for motion estimation", VLSI Implementations for Image Communications, P. Pirsch (Ed.), 1993, pp. 251-2281.

[4] J. Baek et al., "A fast array architecture for block matching algorithm", Proc. of IEEE ISCAS, Vol. 4, 1994, pp. 211-214.